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Abstract The finding that trying, and faiUng, to predict the 
upcoming to-be-remembered response to a given cue can 
enhance later recall of that response, relative to studying the 
intact cue-response pair, is surprising, especially given that 
the standard paradigm (e.g., Komell, Hays, & Bjork, 2009) 
involves allocating what would otherwise be study time to 
generating an error. In three experiments, we sought to elim- 
inate two potential heuristics that participants might use to aid 
recall of correct responses on the final test and to explore the 
effects of interference both at an immediate and at a delayed 
test. In Experiment 1, by intermixing strongly associated to- 
be-remembered pairs with weakly associated pairs, we elim- 
inated a potential heuristic participants can use on the final test 
in the standard version of the paradigm — ^namely, that really 
strong associates are incorrect responses. In Experiment 2, by 
rigging half of the participants' responses to be correct, we 
eliminated another potential heuristic — namely, that one's 
initial guesses are virtually always wrong. In Experiment 3, 
we examined whether participants' ability to remember — and 
discriminate between — their incorrect guesses and correct 
responses would be lost after a 48-h delay, when source 
memory should be reduced. Across all experiments, we con- 
tinued to find a robust benefit of trying to guess to-be-leamed 
responses, even when incorrect, versus studying intact cue- 
response pairs. The benefits of making incorrect guesses are 
not an artifact of the paradigm, nor are they limited to short 
retention intervals. 
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An abundance of research on testing and generation effects 
has shown that the act of retrieval is a learning event — 
and often a powerful learning event — in the sense that 
the retrieved information becomes more retrievable in 
the future than it would have been otherwise (see, e.g., 
Roediger & Karpicke, 2006). The retrieval processes 
triggered by testing are, therefore, opportunities for 
learning — a basic fact about human learning that is 
often not appreciated or, at least, is underappreciated, 
by students (see, e.g., Karpicke, Butler, & Roediger, 
2009; Kornell & Bjork, 2007). 

Testing effects and generation effects, however, typically 
refer to the consequences of successfiil retrieval or generation. 
One justifiable concem about testing or generation is that what 
is retrieved, whether correct or incorrect, will be leamed: That 
is, by virtue of the very power of retrieval as a learning event, 
it seems likely that any errors that are produced will persist. 
One influential school of thought, for example, inspired by 
Skinnerian principles of learning, has emphasized "errorless 
learning" procedures (Skinner, 1958; Terrace, 1963), and a 
number of studies have, in fact, shown that initially incorrect 
responses often persist on subsequent tests (e.g., Cunningham 
& Anderson, 1968; Elley, 1966; Kaess & Zeaman, 1960; 
Marsh, Roediger, Bjork, & Bjork, 2007). Additionally, gener- 
ating errors before being given feedback mirrors a classic A- 
B/A-D interference paradigm (e.g., Briggs, 1954), in which 
researchers have found that participants do, indeed, become 
more likely to output the initial "B" response as the retention 
interval increases. 

The picture, though, is not so clear. Other studies investi- 
gating the effects of errors on multiple-choice tests (e.g., 
Butler, Marsh, Goode, & Roediger, 2006), for example, have 
shown no effect of generating errors, and other recent — and 
not so recent — findings suggest that there might, in fact, be 
benefits of trying to generate a correct response, even when 
the effort fails. 
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That even failed efforts to generate a to-be-remembered 
response might have benefits is suggested by the results of 
early research by Slamecka and Fevreiski (1983). Participants 
were presented with a list of related cue-target word pairs and 
were asked to say the target word aloud. In a study-only 
condition, the participants were shown the intact pair (e.g., 
pursue-avoid); in a generate condition, participants were 
shown the full cue word together with a fragment of the target 
word (e.g., pursue-av-d) . If they failed to generate the 
target word within a 4-s interval, they were provided 
with corrective feedback immediately for 3 s. On a 
subsequent free recall test of the targets, there was a 
benefit of generate over study-only, even when only 
those items for which participants failed to generate 
the correct response were examined. The authors argued 
that failed generations were, in fact, incomplete genera- 
tions, where semantic features, but not surface features, 
were processed. 

In Slamecka and Fevreiski's (1983) study, however, 93 % 
of the errors were errors of omission, not errors of commis- 
sion, so their findings leave open the possibility that generat- 
ing overt errors has negative, not positive, effects. Recently, 
though, Komell, Hays, and Bjork (2009), using a procedure in 
which participants' guesses of to-be-learned responses are 
wrong with high probability (thus, eliminating differences 
between items in the errorful and errorless conditions — a 
confound in some previous studies), extended the finding to 
cases where participants do not simply omit responses, but 
produce errors. Their results suggest that producing errors, at 
least under some circumstances, enhances subsequent 
learning. 

Komell et al.'s (2009) findings have stirred considerable 
interest, not only because producing incorrect guesses does 
not seem, intuitively, to be a good leaming technique, but also 
because their specific procedure involved taking what would 
otherwise be study time to predict (erroneously) an upcoming 
to-be-learned response. In the guess-first condition of their 
Experiment 4, for example, participants were shown cues such 

as Whale: for 8 s and were asked to predict the 

upcoming to-be-learned associate of that cue. Immediately 
after, they were then shown the cue together with the to-be- 
learned response {Whale: Mammal) for 5 s (97 % of the 
guesses were incorrect, and the trials on which guesses 
matched the target were removed from analyses). In 
their study-only condition, on the other hand, pairs such 
as Whale: Mammal were shown for the full 13 s. The 
guess-first condition produced better later recall of the 
correct target than did the study-only condition, despite 
the shorter study time and the reasonable expectation 
that generating a competing associate would create pro- 
active interference. Kornell et al.'s basic finding has 
now been replicated by a number of other investigators 
(Grimaldi & Karpicke, 2012; Hays, Komell, & Bjork, 



2013; Huelser & Metcalfe, 2012; Knight, Ball, Brewer, 
DeWitt, & Marsh, 2012; Vaughn & Rawson, 2012), as 
well as with foreign language leaming (Potts & Shanks, 
2014) and more semantically rich text passages 
(Richland, Kornell, & Kao, 2009) and trivia facts 
(Kornell, 2014). 

Questions and issues motivating the present research 

Why do we not find interference in these experimental para- 
digms? In the present series of experiments, we seek to ad- 
dress two issues: (1) that the experimental paradigm design 
allows participants to distinguish between their guess and the 
correct answer at the time of the final test, and (2) whether the 
guess-first benefit will be maintained or whether the generated 
guesses will interfere with target recall at a longer retention 
interval. 

One explanation of the benefits of guessing incorrectly is 
that a participant's incorrect guess acts as a mediator between 
the cue and the correct response. An assumption that underlies 
this explanation is that learners have a means of knowing, at 
the time of the final test, which response — the one they 
generated or the one they then studied — is the correct 
response. 

In Experiment 1 , we set out to examine whether a feature 
intrinsic to Komell et al.'s (2009) paradigm might play a key 
role in leamers being able to make that judgment. Because 
Komell et al. wanted to examine whether making incorrect 
guesses would help or hinder leaming, they chose weak 
associates of the cue word as to-be-leamed response tar- 
gets — ^that is, words that were unlikely to come to mind and 
be guessed in advance by participants. In Experiment 1, we 
explored whether participants in prior experiments may have 
been able to use the fact that generated errors tended to be 
strong associates of the cue words, whereas target responses 
were always weak associates of a given cue. Could partici- 
pants have mitigated interference at the final test between 
competing responses, generated errors and targets, by leaming 
that targets are weak associates? We nuUified that possible 
heuristic in Experiment 1 by designing the materials so that 
the correct answer for half the pairs was a strong associate of 
the cue word. 

In Experiment 2, we sought to nullify another possible 
heuristic that participants could be using in this paradigm: that 
their guesses are always wrong. The errorful generation para- 
digm — as used by Komell et al. and in subsequent follow-up 
studies (Grimaldi & Karpicke, 2012; Huelser & Metcalfe, 
2012; Knight et al, 2012; Potts & Shanks, 2014; Vaughn & 
Rawson, 2012) — ensures that the guess is almost always 
wrong, leaving open the possibility that when presented with 
the cue at final test, participants are able to simply select 
whatever response they did not generate for themselves. 
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Therefore, we rigged Experiment 2 so that, in one condition, 
half of participants' guesses were always deemed to be cor- 
rect, and compared the benefit of making errors in this condi- 
tion with the original condition where just about all the 
guesses were incorrect. 

Variations of the original paradigm have been investigated 
to test different theories as to why there is a benefit of gener- 
ating incorrect guesses, and these theories are further 
discussed in the General Discussion section. Despite varia- 
tions on this original design, however, whether participants 
could use a heuristic remains an open question. One variant 
(e.g., Grimaldi & Karpicke, 2012; Hays, Komell, & Bjork, 
2013) found that delaying feedback of the correct answer 
eliminates the benefit of making incorrect guesses. While 
one explanation is that delaying feedback means that the 
correct target is not encoded into an activated semantic net- 
work, it could also be that having first generated guesses to all 
the guess-first word pairs before receiving the correct answer 
makes it more difficult for learners to recognize that all the 
correct responses are less obvious associates of the cue or even 
that all their initial guesses are wrong. Another variant on the 
original design showed that the benefit of generating re- 
sponses was eliminated when participants' guesses were 
constrained to a particular word (through the provision of a 

two-letter stem — e.g., tide-wa ; Grimaldi & Karpicke, 

2012). By constraining the guess to one obvious target re- 
sponse, the experimenters created a very different task than is 
experienced by participants making unconstrained anticipato- 
ry "guesses." Instead of interpreting the constrained genera- 
tions as "wrong answers," participants may simply interpret 
them as other correct answers that are simply not required on 
the later test. 

Experiment 3 was designed to examine whether the ability 
of participants to discriminate at the time of test between the 
response they guessed and the actual correct responses de- 
pends on the retention interval to the final testing being 
relatively short. Prior studies have used very short retention 
intervals in which participants are readily able to retrieve their 
initial guesses and to distinguish their guesses from the correct 
targets (e.g.. Knight et al, 2012; Vaughn & Rawson, 2012). A 
question that remains, however, is whether incorrect guesses 
might become interfering at a long delay. At a delay, we 
expect that participants will display weaker episodic discrim- 
ination and a relatively stronger memory trace for generated 
guesses, as compared with studied targets. The combination of 
these two factors could create a case where generated 
guesses proactively interfere with access to the correct 
targets. If there is indeed no benefit (or even a detri- 
ment) to making guesses at long delays, this finding 
would have implications for applications of generating 
errors in education. In Experiment 3, therefore, we 
investigate whether making erroneous guesses starts to 
interfere after a longer retention interval (48 h). 



Experiment 1 

In Experiment 1, we replicated Kornell et al.'s (2009) 
Experiment 4, but with two changes. First, as was mentioned 
above, to nullify participants being able to use relative asso- 
ciative strength as a discriminative cue at the time of the final 
test, we made half of the to-be-learned responses strong, rather 
than weak, associates of the cue words. If the advantage of 
guessing before study is due to the use of a "the-answer-is- 
always-weakly-associated" heuristic at the final test and 
mixing high associates with the low associates prevents the 
usage of this strategy, the benefit of guessing-first over only 
studying on the final test should be eliminated. Second, in 
addition to asking participants to recall the correct targets on 
the final cued-recall test, we also asked them to recall their 
initial guesses. We reasoned that if guesses competed and 
interfered with the ability to retrieve the targets, we should 
see better recall of targets when participants are unable to 
recall their incorrect guesses. 

Method 

Participants and design 

Thirty- four undergraduates from the University of California, 
Los Angeles (UCLA) participated in Experiment 1 . The par- 
ticipants received partial course credit as compensation. We 
manipulated study condition {guess-first vs. study-only) and 
word-pair association strength {strong vs. weaJi) within sub- 
jects. In the cued-recall test phase, participants were asked — 
in response to each cue word — ^to recall the correct target and 
then the target they had guessed during the study phase prior 
to seeing the correct response. 

Materials and apparatus 

Sixty paired associates were used. Half were weakly associ- 
ated word pairs with forward association strength between 
0.05 and 0.054 (e.g., Olive: Branch); half were strongly 
associated word pairs with a forward association strength 
between 0.3 and 0.4 (e.g.. Table: Chair). The weak associates 
were a randomly selected set of 30 pairs, taken from the 
materials of Komell et al. (2009). All the words were, at 
minimum, four letters long. Half of the word associates were 
randomly assigned to the guess-first condition, which com- 
prised 15 strong associates and 15 weak associates, and the 
remaining 30 were assigned to the study-only condition. 
Assignment of these two sets of 30 word pairs was 
counterbalanced across participants. The order in which the 
four within-subjects conditions (strong vs. weak; guess-first 
vs. study-only) appeared was block randomized; the list was 
divided into 15 blocks of four trials, where each block 
consisted of one pair from each within-subjects condition 
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(therefore controlling for serial position effects between the 
conditions). From the participants' point of view, however, 
they saw only one long list of 60 word pairs. Finally, the order 
of the word pairs was fully randomized. 

The experiment was created using Collector (https://github. 
com/gikeymarcia/Collector), an open-source PHP-based pro- 
gram designed to run psychology experiments and conducted 
via an Internet browser. Participants came into the laboratory 
and were administered the study on 2 1.5 -in. Apple iMac 
desktop computers. The web browser was opened full-screen, 
and instructions and word pairs were all presented in the 
center of the screen. 

Procedure 

The study was composed of two phases: a study phase and a 
final cued-recall test phase. For the study phase, participants 
were told that they would study pairs of related words. 
Sometimes they would see complete pairs, whereas other 
times the second word would be missing. When pairs were 
shown incomplete, participants were told that they should try 
to guess the upcoming to-be-leamed response, after which 
they would be shown the correct answer. Participants were 
shown the 60 word pairs one at a time. In the guess-first 
condition, they were presented with a cue and a blank (e.g., 

Olive: ) and were given 8 s to make a guess (e.g., they 

might guess "Martini"). Participants were instructed to always 
make a guess, rather than to leave the space blank. The full 
cue-target pair (e.g. Olive: Branch) was then shown for 5 s 
immediately after making their guess. In the study-only con- 
dition, participants were presented with the full cue-target pair 
twice consecutively, for 8 and 5 s, respectively. 

After a 5-min retention interval, participants were then 
given a final cued-recall test on all 60 word pairs. During the 
final cued-recall test, participants were shown a given cue 
twice followed by a blank line each time. Participants were 
informed that every cue word would be presented twice con- 
secutively and were instructed to fill in the first blank with the 
correct target. For example, if they were presented with the 

cue: "Olive: ," they should type "Branch'' (the correct 

target) the first time they see "Olive: " in the final test. 

They were instructed that for the second blank, they should 
type in their original guesses, if the pair was in the guess-first 
condition. In the example given then, that means that they 
should type in "Martini" for the immediately subsequent, 

second presentation of "Olive: If they had not been 

asked to make a guess for the cue word in the study phase (i.e., 
the pair had been in the study-only condition), participants 
were told to type in "Read'' instead of an initial guess. It was 
not indicated during the final test whether the pair was in the 
guess-first or the study-only condition, and the second blank 
appeared regardless of whether participants were able to fill in 
the first blank (i.e., recall the correct target). Participants were 



not given any explicit instruction about whether they should 
always fill in the blank, and many left the space blank if they 
could not recall the answers. The pairs were presented in a 
randomized order, and the test was self-paced. 

Results and discussion 

Although comparison of the weak and strong associates is not of 
primary concem — ^the strong associates were included simply to 
reduce the possible use of a "the answer-is-always-weakly-asso- 
ciated" heuristic — ^we analyzed the strong and weak pairs sepa- 
rately. Successftil guess rates were 4 % for the weak associates 
and 9 % for the strong associates for the guess-first pairs during 
the study phase. The rates for the weak associates were about as 
expected, but the rates for the strong associates were lower than 
expected. This lower rate may reflect that the pairs were 
intermixed, meaning that participants could leam that the most 
obvious associates were only infrequently the correct responses, 
leading to a reduced success rate for the strong associates. All 
analyses reported in this article are restricted to the items where 
the guess was incorrect. Additionally, responses were counted as 
correct only if they were typed into the appropriate spaces; in 
other words, recalled targets were counted as correct if entered 
into the first blank, but not if entered into the second. 

Recall of correct targets 

As is shown in Fig. 1, we replicated the basic finding — 
namely that the guess-first condition produced better later 
recall of the target response than did the study-only condition, 
despite the presence of strongly associated to-be-leamed pairs. 

Furthermore, the benefit of making incorrect guesses was 
present for both strongly associated to-be-leamed pairs and 
weakly associated pairs. A two-way (study condition x asso- 
ciation strength) within-subjects ANOVA showed that there 
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was a main effect of study condition, F(l, 33) = 45.06, 
MSB = .03, p < A, rfp = .58: Pairs in the guess-first 
condition (M= .79, SD = . 1 1) were recalled significantly better 
than the pairs in the study-only condition (M= .59, SD= .1^), 
t(33) = 6.73, ;7 < .001, Cohen's d = 1.15 . There was no 
significant effect of association strength, F(l, 33) =0 .04, 
MSB =m,p> .05, r]\ = .001. 

The study condition x association strength interaction was 
marginally significant, F(l, 33) = 4.04, MSB = .02, p = .053, 
77^p = . 1 1 . The benefit of making an incorrect guess appears to 
have been marginally larger for the strong associates (.81 vs. 
.57 for guess-first and study-only conditions, respectively) 
than for the weak associates (.76 vs. .62), although the benefit 
was significant for both weak and strong associates. 

Whatever the reason for the strongly associated pairs show- 
ing at least as large a benefit of error generation, the key point 
is that the benefits of the guess-first condition found by 
Komell et al. (2009) and subsequent research findings appear 
not to be a consequence of participants being able to adopt a 
heuristic at the time of the final test — namely, that the correct 
response is the weaker of the two remembered associates to a 
given cue word. 

Participants' ability to recall their initial guesses 

Participants ability to recall their initial guesses (M= .79, SD = 
.13) and the correct answers in the guess-first condition (M = 
.19, SD=.\\)m not differ, <33) = 0.\5,p> .05, Cohen's d = 
0.026; neither was there a difference in their recall of their 
guesses to the strong-associate cues {M= .81, ^SZ) = . 14) and to 
the weak-associate cues {M= .11, SD = .17), <33) = 1.30, p > 
.05, Cohen '^ (i= 0.22. Intrusion rates of guesses into the blank 
space provided for targets and vice versa were very low: 
Guesses intruded into recall of targets only 1 .4 % of the time 
{SD = 3.1 %), and targets intruded into recall of guesses only 
2.9 % of the time {SD = 4.8 %). Thus, there was no evidence 
that initial guesses were suppressed, repHcating the findings of 
Vaughn and Rawson (2012) and Knight et al. (2012). 

For the study-only trials, participants correctly typed 
"Read" in the second blank provided for each given cue 
78.2 % {SD = 30.4 %) of the time. The large standard devia- 
tion simply represents the 6 participants who may have mis- 
understood the instructions (3 of whom mostly left the space 
blank, and 3 of whom either provided the target a second tim 
or entered in completely new cue-related words). 

Target recall conditional on guess recall 

When we examine target recall, conditional upon ability to 
recall initial guesses, we see interesting patterns. A 2 (strength: 
weak, strong) vs. 3 (study condition: guess recalled, guess 
unrecalled, study only) within- subjects ANOVA revealed a 
main effect of study condition, F{2, 58) = 31.6, MSB = .04, 



p < .001, T] p = .52, but no main effect of strength and no 
interaction, Fs<l. Data from 4 participants were not included 
in this ANOVA analysis because they had perfectly recalled 
their initial guesses to either all the weak associate or strong 
associate pairs. In Fig. 2, we collapse across strength and 
compare correct target recall performance of all of the study- 
only items with that of the guess-first items for which guesses 
were also recalled and with the guess-first items for which the 
guesses were not recalled. As is shown in Fig. 2, there is a 
benefit of generating guesses, but only when those initially 
incorrect guesses are later recallable. 

Post hoc ^tests showed that while there was a benefit of 
making incorrect guesses {M= .^5, SD= .11) over pure study 
{M= .59, SD= .19) when guesses were retrieved, ^(33) = 8.38, 
p < .001, Cohen's d = 1.43, there was no difference between 
recall of targets of guess-first items when participants could 
not recall their guesses {M= .59, SD = .22) and recall of targets 
of study-only items, t{33) = 0.01 ,p> .05, Cohen's ^/ < 0.01. 
Additionally, there was a significant difference between recall 
of the guess-first targets when guesses were recalled and when 
they were not, t{33) = 1.29,p < .001, Cohen's 1 .25. These 
analyses suggest that participants' accessibility to their 
guesses also allows for greater accessibility of the targets 
and replicate the patterns found in prior studies (Butler, 
Fazio, & Marsh, 2011; Knight et al., 2012; Vaughn & 
Rawson, 2012). 



Experiment 2 

In Experiment 1, despite our expectations, successful guess 
rates between high and low associate words pairs were not 
dramatically different. Prior research by Koriat, Fiedler, and 
Bjork (2006) also suggests that hindsight bias can make it 
difficult for participants to accurately judge the likelihood of 
generating a target given a cue, particularly when the cue- 
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target pair is related: When shown a Hst of word pairs of zero, 
low, or high association, participants grossly overestimated 
the percentage of people who would generate the target given 
the cue and showed a remarkable underappreciation of the 
difference between high- and low-associate pairs. If, in hind- 
sight, participants are unable to judge which word is a stronger 
associate to the cue word, then Experiment 1 might not have 
worked to fully address the heuristic that the correct target is 
always a weak associate. 

In an attempt to address these concems, we conducted an 
experiment similar to Koriat et al. 's (2006) study. We present- 
ed 33 participants with the word pairs and asked them, first, to 
judge the number of people out of 1 00 who would generate 
the target given the cue and then to categorize half of the pairs 
as "strong" and the other half as "weak" associates. As with 
Koriat et al, participants greatly overestimated the likelihood 
of generating the target given the cue for both strong (M = 
60 %, SD=\2 %) and weak {M= 47 %, SD=\?> %) associate 
pairs, and they miscategorized 39 % {SD = 5 %) of the pairs as 
strong or weak. These findings suggest, therefore, that the 
subjective experience of participants in Experiment 1 did not 
differ as markedly as we expected for strongly and weakly 
associated pairs. 

Another heuristic that would be easy for participants to use 
at the time of test in the original error generation paradigm is 
that almost every response they generate is incorrect. That is, 
if it is easy to distinguish between their generated response 
and the correct response at the time of test, then the one 
should not interfere with the other. We attempted to 
eliminate the use of this heuristic in Experiment 2 by 
rigging half of participants' responses to be correct. If 
the benefit of guessing first is a result of participants 
using a "my-guess-is-always-wrong" heuristic, then 
mixing correct guesses with the incorrect guesses should 
eliminate this strategy and eliminate the benefit or, at 
least, reduce the size of the benefit, as compared with 
when guesses are always incorrect. 

Method 

Participants and design 

Fifty-nine participants were recruited from Amazon 
Mechanical Turk and were paid $1.50 for their partici- 
pation. As in Experiment 1, study condition (read vs. 
guess-first) was manipulated within subjects. In 
Experiment 2, however, we also manipulated the pres- 
ence of correct guesses (all-incorrect, n = 27, vs. half- 
incorrect n = 32) between subjects. For those in the 
half-incorrect condition, whatever guess the participant 
generated was deemed to be the "correct answer" for 
half of the guess-first word pairs. 



Materials and procedure 

The procedure of Experiment 2 was the same as that in 
Experiment 1, with three exceptions: First, instead of the 
combination of strong and weak associate pairs used in 
Experiment 1, we used the original 60 weak associate pairs 
used in Komell et al. (2009). Second, for each individual, the 
word pairs were randomly assigned into one of three groups of 
20 word pairs: study-only, guess-first, or filler. The two former 
conditions matched the study-only and guess-first conditions 
in Experiment 1. Filler words were presented in the same 
manner as guess-first words (i.e., participants spent 8 s gen- 
erating a guess for the target word, given the cue, and 5 s 
studying the "correct" word that goes with the cue). The only 
difference was that in the half-incorrect condition, the "cor- 
rect" word shown was whatever guess the participant had 
generated (i.e.. Cue: Guess), while in the all-incorrect 
condition, the "correct" word shown was the weakly 
associated target from the Kornell et al. stimuli (i.e., 
Cue: Target). The study phase presentation order of the 
word pairs was block randomized into 10 blocks of six 
word pairs. Each block of six pairs consisted of two 
study-only pairs, two guess-first pairs, and two filler 
pairs. Following the study phase, participants were test- 
ed on all 60 presented pairs in random order. Finally, to 
reduce the complexity of instructions at the test phase, 
participants were tested only on their recall of the 
correct responses; that is, they were not asked to recall 
their original guesses or to identify whether the pair had 
been in the guess-first or study-only condition. 

Results and discussion 

Successful guess rate in the guess-incorrect condition was 
6.5 %. Those pairs in which the guesses matched the intended 
target in the guess-incorrect condition were eliminated from 
the analyses. 

If the benefit of guessing over study-only in the original 
paradigm was a result of using a guesses-are-always-incorrect 
heuristic, the benefit should be eliminated in the half-incorrect 
condition. A 2 (study-only vs. guess-first) x 2 (all-incorrect vs. 
half-incorrect) mixed ANOVA revealed, however, only a 
main effect of study condition, F(l, 57) = 31.39, USE = .02, 
p < .001, 77p^ = .36. In other words, there was a benefit of 
making incorrect guesses (M= .63, SD= .24) over study-only 
(M= .50, SD = .25). There was no main effect of the presence 
of correct guesses, F(l, 57) = 1.63, USE = .10, p > .05, 
77p^ = .03, although performance in the half-incorrect 
condition (M = .60, SD = .22) was numerically higher than 
that of the all-incorrect condition (M = .53, SD = .23). 
Critically, however, there was no interaction between study 
condition and the presence or absence of correct 
guesses, F(l, 57) < 1. In other words, guess-first was 
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significantly better than the read condition when all guesses 
were incorrect (M = .15, SD = .19) and when half of the 
guesses were rigged to be correct (M =A2, SD = .16), and 
the magnitude of the benefit did not change depending on the 
presence or absence of correct guesses. 

Within the all-incorrect condition, performance on the 20 
"filler" pairs (M = .61, SD = .23) was, as expected, not 
significantly different from performance on the guess-first 
pairs, t(26) = 0.61, p > .05, Cohen's d = 0.12. Within the 
half-incorrect condition, guesses rigged to be correct (M= .80, 
SD = . 15) was significantly higher than recall of targets in the 
guess-incorrect and study-only conditions,/? < .01. This ben- 
efit of correct guesses is to be expected on the basis of what we 
know about the generation effect (Slamecka & Graf, 1978). 
Finally, although we did not ask participants in 
Experiment 2 to distinguish between their correct and 
incorrect guesses, or the guess-correct trials from the 
guess-incorrect trials, it is interesting to note that on 
the guess-incorrect trials, initial guesses intruded in on 
the recall of correct answers only 6 % of the time in the 
all-incorrect condition and 8 % in the half-incorrect 
condition. Coupled with the fact that participants were 
able to correctly recall correct guesses 80 % of the 
time, it appears that they are able to distinguish between 
their correct and incorrect guesses, ruling out the "my- 
guess-is-always-wrong" heuristic. 



Experiment 3 

Experiments 1 and 2 eliminated two potential heuristics that 
might help explain why making an incorrect guess can en- 
hance later recall. We also replicated the finding that, at a short 
delay, participants not only are able to recall their original 
guesses very well, but also are able to discriminate between 
the incorrect guesses they had generated and the correct target. 
The conditional analyses in Experiment 1 also suggested that 
participants' ability to recall their original guesses — ^rather 
than interfering — was related to their ability to recall the 
correct answer. In other words, source memory is very accu- 
rate with only a 5-min delay between the study and test 
phases. Source memory after a longer retention interval, how- 
ever, may not be as accurate, and an inability to distinguish 
between generated responses and correct responses may lead 
to an overall benefit of study-only over guess-first. 

Method 

Participants 

Twenty-nine undergraduates from UCLA participated in 
Experiment 3 for course credit. 



Design, materials, and procedure 

Experiment 3 was the same as Experiment 1 , with the excep- 
tion of two changes: First, instead of a 5-min delay, there was a 
48-h interval between the study phase and the test phase. 
Second, the study was conducted entirely online, instead of 
in the laboratory. Participants were first given a link to com- 
plete the study phase, which was identical to the study phase 
in Experiment 1 . Approximately 48 h later, participants were 
asked, via email, to finish the test phase online, recalling both 
the correct targets and their initial guesses, as in Experiment 1. 
Of the participants, 100 % completed the test phase. On 
average, participants took the delayed test 61 h after initial 
study, and there was no significant correlation between the 
time of delay and final recall performance or initial responses 
recall performance. 

Results and discussion 

Overall, correct guess rates were 4 % for the weak associates 
and 8 % for the strong associates for the guess-first pairs 
during the study phase. These figures are comparable with 
those found in Experiments 1 and 2. Again, all analyses are 
restricted to those items for which initial guesses were 
incorrect. 

Recall of correct targets 

As was expected, overall recall performance was lower after 
this longer delay than it was after the 5-min delay in 
Experiment 1, but the pattem, as shown in Fig. 3, is otherwise 
remarkably similar to the findings in Experiment 1 . A two- 
way (study condition x association strength) within-subjects 
ANOVA revealed a main effect of study condition, F(l, 28) = 
12.22, MSE = .02, p < .01, r]\ = .30, no main effect of 
association strength, F(l, 28) = 2.04, MSE = M, p > .05, 
r]\ = .07, and no interaction, F(l, 28) = 0.29, MSE = .Ol,p> 
.05, rfp = .01. On average, targets in the guess-first condition 
were recalled 32 % (SD =16 %) of the time, while targets in 
the study-only condition were recalled 23 % (SD =14 %) of 
the time, a difference which was significant, ^(28) = 3A6,p < 
.05, Cohen's (i= 0.64. 

Participants' recall of their initial guesses 

Guesses were recalled at a significantly higher rate (M = .44, 
SD= .19) than the targets (M= .32, SD =.16), <28) = 3.72,/? < 
.05, g = .69. Participants were less able also to correctly type 
"Read"' into the second prompt for those items that had been in 
the study-only condition (M= .32, SD = .1 6), with many of the 
spaces left blank (M= .34, SD = .28) or with new cue-related 
words entered (M = .25, SD = .32). It is unclear, however, 
whether these responses for the study-only items reflect 
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Fig. 3 Proportion of targets correctly recalled, by study condition and 
association strength, in Experiment 3. Error bars represent standard errors 
of the means 



blurred source memory (i.e., participants could not remember 
whether they had initially generated a guess for those words), 
confusion with respect to the instructions, or, perhaps, a 
combination. 

As one would expect with an increased retention interval 
and, hence, decreased episodic discrimination, intrusions rates 
for the guess-first items were also increased, as compared with 
those of Experiment 1 : Initial guesses intruded into recall of 
the targets 12.4 % (SD = 14.7 %) of the time, and targets 
intruded into the recall of the initial guesses 7.5 % (SD = 
12.7 %) of the time. Finally, there was also a marginally 
significant difference in the recall of guesses for the strong- 
associate cue-target pairs (M= .39, SD = .21), as compared 
with the recall of guesses for the weak-associate cue-target 
pairs {M= .48, SD = .23), <28) = 2.01, p = .054, Cohen's J = 
0.37. This pattem of results may be a result of greater inter- 
ference from strong-associate targets, or because the norma- 
tive association strength of the guesses to the cues was higher 
for the weak- associate pairs than for the strong-associate pairs. 
In support of this speculation, the pattem was reversed for the 
recall of targets: Target recall was higher for the strong- 
associates (M= .34, SD = .16) than for the weak-associates 
(M= .30, SD = .19). Although the difference in target recall 
was not significant between the strong and weak associates, 
^(28) = 1 .2S,p > .05, Cohen's d= 0.24, there was a significant 
association strength x response (target vs. guess) interaction, 
F(l, 28) = 11.27, MSB = .012, p < .01, r]\ = .29. 

In sum, with a longer retention interval, guesses — Shaving 
been generated — ^were better recalled than targets. One possi- 
ble outcome of guesses being stronger is that they then, with a 
delay, become more interfering. Yet, despite this greater po- 
tential for interference, we still found a significant benefit of 
making guesses. 



Target recall conditional upon recall of the guesses 

As in Experiment 1, we examined the likelihood of target 
recall conditional upon guess recall, the results of which are 
represented in Fig. 4. A one-way within- subjects ANOVA 
revealed that there were significant differences between the 
recall of the targets of guess-first items when guesses were 
retrieved and when they were not retrieved and the study-only 
items, F{2, 56) = 4.05, MSB =.03, p < .05, r]\ = .13. Post hoc 
^tests showed that while there was a benefit of making incor- 
rect guesses over pure study (M= .23, SD = . 14) when guesses 
were retrieved (M = .36, SD = .24) <28) = 3.46, p < .01, 
Cohen's d = 0.64, there was no difference between recall of 
targets of guess-first items when guesses could not be re- 
trieved (M= .21, SD= .19) and the targets of the study-only 
items, <28) = l.Ol,p> .05, Cohen's d=0.l9. That is, even 
after a 48-h delay, ability to recall guesses is positively asso- 
ciated with recall of the correct targets — one possible inter- 
pretation is through a "mediator" lens: that recalling the guess 
enhanced recall of the target; failure to retrieve one's initial 
guess led to no benefit over pure study. Unlike in 
Experiment 1, however, there was not a significant 
difference between recall of the guess-first targets, when 
guesses were recalled (M= .36, SD = .24) and when guesses 
were not recalled (M= .21, SD = .19), <28) = 1.54, p > .05, 
Cohen's 0.29. 

In sum. Experiment 3 showed that even after a long delay 
(of at least 48 h), participants' incorrect guesses did not 
interfere with the recall of the correct targets. 



General discussion 

In Experiments 1 and 2, we ruled out two factors that might 
contribute in an artificial way to the observed benefits of 
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making an incorrect guess, versus studying the correct pair. 
When participants could not, at the time of the final test, rely 
on "pick the weaker associate" or "pick the one that I did not 
generate," we still found benefits of making incorrect guesses 
over a study-only condition. Finally, Experiment 3 showed, 
remarkably, that errors did not interfere with the ability to 
recall the correct answers but (counterintuitively) were related 
to greater recall of the correct answer. 

Why guessing incorrectly might enhance later recall: possible 
mechanisms 

How do conditions that should, logically, create proactive 
interference and also reduce the participants' time to study 
the target response actually lead to better recall of a to-be- 
learned response? Several possible mechanisms have been 
proposed. 

Suppression 

One mechanism that might explain, at least partially, the 
benefits of making incorrect guesses is that when corrective 
feedback is provided, the incorrect guesses become inhibited 
or suppressed and, therefore, do not interfere. However, con- 
sistent with prior research (Knight et al, 2012; Vaughn & 
Rawson, 2012), the results of Experiments 1 and 3 should that 
participants are readily able to retrieve their initial guesses, 
suggesting that the guesses were not suppressed. 

We might have predicted that making incorrect guesses — 
while beneficial for short-term leaming — ^would proactively 
interfere with recall of correct responses at a longer delay, 
where episodic discrimination between initial guesses and 
correct responses should be degraded. The results of 
Experiment 3, however, show that even after an average of 
61 h, we still find a benefit of making incorrect guesses. 

Mediation 

Another mechanism that has been proposed is that making 
incorrect guesses can, in fact, function as an additional cue, 
aiding the recall of the correct target. Pyc and Rawson (2010) 
demonstrated that when participants are instructed to generate 
mediators, mediator effectiveness (as measured by ability to 
both retrieve and decode mediators during the criterion test) is 
enhanced through testing. Findings by Carpenter (2011) sug- 
gest that expHcit instructions to use mediators may not be 
necessary. Rather, semantic mediators may be covertly gener- 
ated during initial study and, more so, when initial study 
involves testing rather than purely studying. Carpenter found 
that never-presented strong associates to cue words in a study- 
test condition were more likely to be falsely recognized on a 
later recognition test than never-presented strong associates to 
cue words in a study-restudy condition. Furthermore, when 



cued with these strong associates, participants were more 
likely to recall the correct targets in the study-test condition 
than in the study-restudy condition. As they apply to Komell 
et al.'s (2009) paradigm, these mediation ideas suggest that the 
cue for a given pair, the erroneous response that is generated, 
and the target response are integrated into a kind of triplet that 
then aids recall of the target response at the time of the final 
test. 

Three prior studies — Butler et al. (2011), Knight et al. 
(2012), and Vaughn and Rawson (2012) — ^have demonstrated 
that target responses are better recalled when the initial 
guesses are also recalled. In the latter two studies, the same 
cue-target paradigm was used (Butler et al. used general 
knowledge questions), and participants were asked to recall 
their initial guesses first before recalling the targets. In our 
present study, we reversed this order, asking participants to 
provide the targets first before recalling their initial guesses. 
Despite this reversal of output order, our results are similar to 
their findings: Experiments 1 and 3 found that when partici- 
pants were able to recall their initial guesses, they were more 
likely to also recall the correct targets. 

It is not clear, however, whether these studies constitute 
evidence for the mediator hypothesis. While this pattern of 
results would be consistent with the mediator hypothesis, it 
does not necessitate it; this pattem could be the result of item 
selection effects. An alternative account of the results is that 
those trials for which guesses are recalled have simply been 
encoded more deeply and, therefore, both guessed and actual 
targets are more easily recalled. This account would, there- 
fore, not posit that retrieval of the guesses must precede the 
targets (as would be predicted in the strict sense of 'media- 
tion') but, rather, allow for the two to be simply correlated. 

Semantic activation 

Finally, another mechanism proposed by Komell et al. (2009) 
has gained considerable support — ^namely, that trying to pre- 
dict an upcoming to-be-leamed response requires activating 
the semantic network associated with the cue. The basic idea 
is that this activation then affords a richer encoding of the 
subsequently presented target. That is, the to-be-leamed re- 
sponse is then encoded in a richer, more elaborated way, in 
relation to the cue, than would have been had the intact pair 
been shown for study only. 

In support of the semantic activation hypothesis, re- 
searchers have found that the benefit of making incorrect 
guesses is eliminated in cases where semantic activation is 
misguided (e.g., in the case of unrelated word pairs; Grimaldi 
& Karpicke, 2012; Huelser & Metcalfe, 2012; Knight et al, 
2012), when feedback in the guess-first condition is delayed 
(Grimaldi & Karpicke, 2012; Hays et al. 2013; Vaughn & 
Rawson, 2012; but see Kornell, 2014), and when the 
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activation is constrained during guess generation (Grimaldi & 
Karpicke, 2012). 

Many other converging studies — ^using different types of 
learning materials, both in the laboratory and in the class- 
room — also show benefits of incorrect guesses, given that 
the process of generating errors activates semantic networks. 
In the lab, McGillivray and Castel (2010) demonstrated that in 
learning face-age associations, both older and younger adults 
benefit from making a guess as to a face's age before being 
given the answer, even though these guesses were almost 
always incorrect. Importantly, McGillivray and Castel found 
that guessing benefited learning of face-age associations only 
when there was schematic support (i.e., when the to-be- 
learned ages made sense given the cues from the face). 

For Singapore math classrooms, Kapur and Bielaczyc 
(2012) demonstrated the benefit of what they called "produc- 
tive failure." In their study, half of the classes spent six class 
periods trying and failing to solve math problems before 
receiving one period of instruction being given the correct 
answer ("productive failure" condition). Critically, in this one 
period of instruction, teachers not only explained what the 
correct answer was, but also compared and contrasted the 
correct solution to the incorrect solutions. The other half of 
the classes spent all seven periods being taught the correct 
method, practicing questions, doing homework and getting 
feedback ("directed instruction" condition). On a final test, 
those in the productive failure condition performed better than 
those in the directed instruction condition, particularly on 
complex problems and a test of representational flexibility. 

The semantic activation hypothesis cannot be the whole 
story, however, since Potts and Shanks (2014) recently 
showed benefits of anticipating upcoming to-be-leamed re- 
sponses in a series of experiments where (relevant) semantic 
activation is impossible: Participants had to guess and leam 
the definitions of rare or obscure English words and Euskara 
words; words for which they had no way of activating relevant 
semantic concepts. Despite the lack of semantic relationship 
between participants' guesses, the cues, and the targets. Potts 
and Shanks found a robust benefit of generating guesses first 
over simply studying the cue-target pairs. 

Concluding comments 

Our results, together with those obtained by other researchers, 
show that activating knowledge before study and testing, even 
when responses are incorrect, can benefit learning. 
Understanding fully the dynamics that offset what would 
seem major costs of generating incorrect guesses, including 
introducing proactive interference and reducing study time, 
awaits further research, but one implication is that in- 
structors need not consider difficult tests as inherently 
"risky" to administer. 
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